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Field of the Invention 



This invention relates generally to digital signal representation, and more particularly 
to an apparatus and method to compress a video signal for improved source quality 
versus decoding complexity for a given compression bit rate and improved resistance 
against data loss when delivered over an error-prone network. Further, the invention 
provides an apparatus and method to improve the delivery quality of digital multimedia 
streams over a lossy packet network. 



Background of the Invention 



The purpose of source coding (or compression) is data rate reduction. For example, 
the data rate of an uncompressed NTSC (National Television Systems Committee) TV- 
resolution video stream is close to 170 Mbps, which corresponds to less than 30 seconds 
of recording time on a regular compact disk (CD). The choice of a compression standard 
depends primarily on the available transmission or storage capacity as well as the features 
required by the application. The most often cited video standards are H.263, H.261, 
MPEG-1 and MPEG-2 (Moving Picture Experts Group). The aforementioned video 
compression standards are based on the techniques of discrete cosine transform (DCT) 
and motion prediction, even though each standard targets a different application (i.e., 
different encoding rates and qualities). The applications range from desktop video- 
conferencing to TV channel broadcasts over satellite, cable, and other broadcast channels. 
The former typically uses H.261 or H.263 while MPEG-2 is the most appropriate 
compression standard for the video broadcast applications. 

Motion prediction operates to efficiently reduce the temporal redundancy inherent to 
most video signals. The resulting predictive structure of the signal, however, makes it 
vulnerable to data loss when delivered over an error-prone network. Indeed, when data 
loss occurs in a reference picture, the lost video areas will affect the predicted video areas 
in subsequent frame(s), in an effect known as temporal propagation. 

Tri-dimensional (3-D) transforms offer an alternative to motion prediction. In this 
case, temporal redundancy is reduced the way spatial redundancy is; that is, using a 
mathematical transform for the third dimension (e.g., wavelets, DCT). Algorithms based 



on 3-D transforms have proven to be as efficient as coding standards such as MPEG-2, 
and comparable in coding efficiency to H.263. In addition, error resilience is improved 
since compressed 3-D blocks are self-decodable. 

Non-orthogonal transforms present several properties that provide an interesting 
alternative to orthogonal transforms like DCT or wavelet. Decomposing a signal over a 
redundant dictionary improves the compression efficiency, especially at low bit rates 
where most of the signal energy is captured by few elements. Moreover, video signals 
issued from decomposition over a redundant dictionary are more resistant to data loss. 
The main limitation of non-orthogonal transforms is encoding complexity. 

Matching pursuit algorithms provide a way to iteratively decompose a signal into its 
most important features with limited complexity. The matching pursuit algorithm will 
output a stream composed of both atom parameters and their respective coefficients. The 
problem with the state-of-the-art in matching pursuit is that the dictionaries do not 
address the need for decomposition along both the spatial and temporal domains, and also 
the optimization of source coding quality versus decoding complexity for a given bit rate. 

The art in Matching Pursuit (MP) coding is limited. A publication by S. O. Mallat 
and Z. Zhang, entitled "Matching Pursuits With Time-Frequency Dictionaries, 
Transactions on Signal Processing, Vol. 41, No. 12, December 1993 details one 
application of matching pursuit coding. In addition, the publication entitled "Very Low 
Bit-Rate Video Coding Based on Matching Pursuits" , by R. Neff and A. Zakhor, Circuits 
and Systems for Video Technology, Vol. 7, No. 1, February 1997, the publication 
entitled "Decoder Complexity and Performance Comparison of Matching Pursuit and 
DCT-Based MPEG-4 Video CodecS, by R. Neff, T. Nomura and A. Zakhor, Circuits and 



Systems for Video Technology, Vol. 7, No. 1, February 1997, and U.S. Patent No. 
5,699,121, detail using a 2-D matching pursuit coder to compress the residual prediction 
error resulting from motion prediction. 

The shortcomings of the prior art include, first, that matching pursuit has never been 
proposed for coding 3-D signals. Second, the basic functions have been limited to Gabor 
functions because they were proven to minimize the uncertainty principle. However 
these functions are generally isotropic (same scale along x- and y-axes) and do not 
address image characteristics such as contours and textures. 

What is needed, therefore, is a system and method to represent a video signal for 
improved source quality versus decoding complexity for a given compression rate and 
improved resistance to data loss when delivered over an error prone network. 



Summary of the Invention 



The foregoing and other objectives are realized by the present invention, which 
alleviates the problems related to hybrid DCT/motion prediction coding by using a 3-D 
matching pursuit algorithm. The invention defines a separable 3-D structured dictionary. 
The resulting representation of the input signal is highly resistant to data loss (non- 
orthogonal transforms). Also, it improves the source coding quality versus decoding 
requirements for a given target bit rate (anisotropy of the dictionary). The invention 
disclosed herein provides an apparatus and method to compress a video signal for 
improved source quality versus decoding complexity for a given compression rate and 



improved resistance against data loss when delivered over an error-prone network. The 
method relies on a 3-D matching pursuit algorithm with an improved 3-D dictionary. The 
3-D MP encoder transforms blocks of frames into a set of spatio-temporal functions from 
the improved dictionary. The 3-D coder outputs a video stream that is highly resistant to 
data loss. Also, the proposed dictionary is optimized for source quality versus decoding 
complexity for a given compression rate. The present invention mainly targets 
asymmetric applications (high-end PC for encoding and a wide range of devices for 
decoding). 



Brief Description of the Drawings 



The advantages of the present invention will become readily apparent to those 
ordinarily skilled in the art after reviewing the following detailed description and 
accompanying drawings, wherein: 

FIG. 1 is a block diagram illustrating the overall architecture in which the present 
invention takes place; 

FIG. 2 illustrates the Signal Transform Block 100 from FIG. 1, 

FIG. 3 is a flow graph illustrating the Matching Pursuit iterative algorithm of FIG. 2; 

FIG. 4 shows an example of a spatio-temporal dictionary function in accordance with 
the present invention; 

FIG. 5 shows an example of video signal reconstruction after 100 Matching Pursuit 
iterations; and 

FIG. 6 shows an example of video signal reconstruction after 500 Matching Pursuit 
iterations. 



Detailed Description of the Invention 



The present invention alleviates the problems related to hybrid DCT/motion 
prediction coding by using a 3-D matching pursuit algorithm. The invention defines a 
separable 3-D structured dictionary. The resulting representation of the input signal is 
highly resistant to data loss (non-orthogonal transforms). Also, it improves the source 
coding quality versus decoding requirements for a given target bit rate (anisotropy of the 
dictionary). 

Matching Pursuit (MP) is an adaptive algorithm that iteratively decomposes a 
function /€L 2 (9i) (e.g., image, video) over a possibly redundant dictionary of 
functions called atoms (see Figure 3). Let D = \g r be such a dictionary with ||g r || = 1 
/ is first decomposed into: 

f = (gy0\f)g r 0+Rf, 

where (g y0 1 f)g r0 represents the projection of / onto g y0 and Rf is the residual 
component. Since all elements in D have a unit norm, g y0 is orthogonal to Rf , and this 
leads to: 

In order to minimize and thus optimize compression, one must choose g r0 such that 
the projection coefficient \(g rQ \f\\ is maximum. Applying the same strategy to the 



residual component carries the pursuit farther. After N iterations, one has the following 
decomposition for / : 



with, R 0 / = / . Similarly, the energy ||/|| 2 is decomposed into: 



W-SKj-i^'+M 

*=0 



Although matching pursuit places very few restrictions on the dictionary set, the 
structure of the latter is strongly related to convergence speed and thus to coding 

efficiency. The decay of the residual energy |/?Vf has indeed been shown to be upper- 
bounded by an exponential, whose parameters depend on the dictionary. However, true 
optimization of the dictionary can be very difficult. Any collection of arbitrarily sized and 
shaped functions can be used, as long as completeness is respected. 

The method of the present invention is useful in a variety of applications where it is 
desired to produce a low to medium bit rate video stream to be delivered over an error- 
prone network and decoded by a set of heterogeneous devices. 

Let first the dictionary define the set of basic functions used for the signal 
representation. The basic functions are called atoms. The atoms are represented by a 
possibly multi-dimensional index y 7 and the index along with a correlation coefficient 
c n forms an MP iteration. 



The method of the present invention then is as follows. As illustrated in FIG. 2, the 
original video signal / is first passed to a Frame Buffer 101 to form groups of K video 
frames of dimension X x Y . The method of the present invention thus decomposes the 
input video sequence into K-frames long independent 3D blocks. The dictionary 102 is 
composed of atoms, which are also 3-D functions of the same size, i.e., KxXxY . The 
method as shown in FIG, 3 iteratively compares the residual 3-D function with the 
dictionary atoms and elects in the Pattern Matcher 103 the 3-D atom that best matches the 
residual signal (i.e., the atom which best correlates with the residual signal). The 
parameters of the elected atom, which are the index y and the coefficient c n are sent 
across to the following block performing the Coding (i.e., quantization, entropy coding 
probably followed by channel coding, as shown in FIG. 1). The pursuit continues up to a 
predefined number of iterations N, which is either imposed by the user, or deduced from 
a rate constraint and/or a source coding quality constraint. 

The method relies on a structured 3-D dictionary 102, which allows for a good trade-off 
between dictionary size and compression efficiency. In our method, the dictionary is 
constructed from separable temporal and spatial fonctions, since features to capture are 
different in spatial and temporal domains. An atom dictionary is therefore written as 
g f (**yjt) = yr l xSp(x,y)xTfl(k), where y corresponds to the parameters that 
transform the generating function. The parameter W is chosen so that each atom is 
normalized, i.e., |g r (x,j>,*)| 2 =1. Each entry of the dictionary therefore consists in a 
series of 7 parameters. The first 5 parameters specify position, dilation and rotation of the 



spatial function of the atom, S,(x,y) . The last 2 parameters specify the position and 
dilation of the temporal part of the atom, T n (k) . 

The spatial function in a preferred embodiment of the present invention is generated 
using B-splines, which provides the advantages of having a limited and calculable 
support, and optimizes the trade-off between compression efficiency (i.e., source coding 
quality for a given target bit rate) and decoding requirements (i.e., CPU and memory 
requirements to decode the input bit stream). A B-spline of order n is given by: 
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where [y]" represents the positive part of y". 

The 2-D B-spline is formed with a 3rd order B-spline in one direction, and its first 
derivative in the orthogonal direction to catch edges and contours. Rotation, translation 
and anisotropic dilation of the B-spline generates an overcomplete dictionary. The 
anisotropic refinement permits to use different dilation along the orthogonal axes, in 
opposition to Gabor atoms. Thus the spatial dictionary maximizes the trade-off between 
coding quality and decoding complexity for a specified source rate. The spatial function 
of the 3-D atoms can be written as = x , with: 



/ sinQ)(x - p x ) - cos(<p)(y-p y ) I 
P d 2 

The index y s is thus given by 5 parameters; these are two parameters to describe an 
atom's spatial position (p x ,p y ), two parameters to describe the spatial dilation of the 
atom (d x> d y ) , and the rotation parameter <p 

The temporal function is designed to efficiently capture the redundancy between 
adjacent video frames. Therefore T^k) is a simple rectangular function written as: 



The temporal index y t is here given by 2 parameters; these are one parameter to 
describe the atom's temporal position p k and one parameter to describe the temporal 
dilation d k . 

The index parameters range (p x ,P y ,P k ,d x ,d y ,d k ,<p) is designed to cover the size of 
the input signal. Spatial-temporal positions allow to completely browse the 3D input 
signal, and the dilations values follow an exponential distribution up to the 3D input 
signal size. The basis functions may however be trained on typical input signal sets to 
determine a minimal dictionary, trading off the compression efficiency. 




FIG, 1 is a block diagram illustrating the overall architecture in which the present 
invention takes place. The Signal Transform block 100 is the focus of this invention at 
which the foregoing transformation takes place. After transformation, the digital signal is 
quantized 200, entropy coded 300 and packetized 400 for delivery over the error-prone 
network 500. A wide range of decoding devices are targeted; from a high-end PC 600, to 
PDAs 700 and wireless devices 800. 

FIG. 2 illustrates the Signal Transform Block 100. The video sequence is fed into a 
frame buffer 101, and where a spatio-temporal signal is formed. This signal is iteratively 
compared to functions of a Pattern Library 102 through a Pattern Matcher 103. The 
parameters of the chosen atoms are then sent to the quantization block 200, and the 
corresponding features are subtracted from the input spatio-temporal signal. 

FIG. 3 is a flow chart illustrating the Matching Pursuit iterative algorithm of FIG. 2. 
The Residual signal 101, which consists in the input video signal at the beginning of the 
Pursuit, is compared to a library of functions and a Pattern matcher 103 elects the best 
matching atom. The contribution of the chosen atom is removed from the residual signal 
104 to form the residual signal of the next iteration. 

The Pattern Matcher 303 basically comprises an iterative loop within the MP algorithm 
main loop, as shown in FIG. 3, The residual signal is compared with the functions of the 
dictionary by computing, pixel-wise, the correlation coefficient between the residual 
signal and the atom. The square of the correlation coefficient represents the energy of the 
atom (107). The atom with the highest energy (112) is considered as the atom that best 
matches the residual signal characteristics and is elected by the Pattern Matcher. The 
atom index and parameters and sent across (118) the Entropy Coder as shown in FIG. 2, 



and the residual signal is updated in consequence (104). To increase the speed of the 
encoding, the best atom search can be performed only on a well-chosen subset of the 
dictionary functions. However, such a method may result in a sub-optimal signal 
representation. 

FIG. 4 shows an example of a spatio-temporal dictionary function. FIG. 5 shows an 
example of video signal reconstruction after 100 Matching Pursuit iterations. FIG. 6 
shows an example of video signal reconstruction after 500 Matching Pursuit iterations. 
Clearly the amount of signal information improves with successive iterations. 

The invention has been detailed in terms of a preferred embodiment. One having skill 
in the art will recognize that modifications may be made without departing from the spirit 
and scope of the invention as set forth in the appended claims. 



